# Importing the Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


<hr>

# Taking care of Missing Data

* `SimpleImputer` is a class that provides basic strategies for imputing missing values, using either the mean, the median or the most frequent value of the row or column where the missing values are located.

In [5]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


<hr>

# Categorical Data
* Categorical data refers to data that represents categories or labels rather than numerical values.
* In the below code cells, the first column of the dataset contains categorical data, which is later one-hot encoded using OneHotEncoder.

# Encoding Categorical Data

## Independent Variable
* In the context of Machine Learning, the independent variable (`X`) is the input or feature(s) that are used to predict the target variable (dependent variable).
* In the below code cells, the `X` array represents the independent variables, containing the features of the dataset.

## Encoding the Independent Variable

* `ColumnTransformer` is a utility class that allows you to apply different preprocessing steps to different columns of a dataset.

* `OneHotEncoder` is a utility class that can convert categorical data into a binary, one-hot encoded format.

* The `transformers` parameter is a list of tuples, where each tuple defines a name, a transformer (in this case, an instance of `OneHotEncoder`), and a list of columns to which the transformer should be applied (in this case, just the first column, since Python uses 0-based indexing). The `remainder='passthrough'` parameter means that any columns not specified in the `transformers` list will be left unchanged.

* The `fit_transform` method first fits the transformer to the data (i.e., it learns any parameters needed, such as the categories in the case of `OneHotEncoder`), and then transforms the data. The result is then converted to a NumPy array and reassigned to `X`.

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Dependent Variable
* The dependent variable (`y`) is the output or the variable being predicted by the machine learning model based on the independent variables.
* In the below code cells, the `y` array represents the dependent variable, containing the target variable of the dataset.

## Encoding the Dependent Variable

* `LabelEncoder` is a utility class that is used to encode categorical data into numerical values.

In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


<hr>

# Splitting the dataset into the Training set and Test set

# Training Set
* The training set is a subset of the dataset used to train the machine learning model.
* In the below code cells, `X_train` and `y_train` represent the training set, which contains 80% of the data and is used to train the model.

# Testing Set
* The testing set is a subset of the dataset used to evaluate the performance of the trained machine learning model.
* In the below code cells, `X_test` and `y_test` represent the testing set, which contains 20% of the data and is used for model evaluation.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [12]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [13]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [14]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [15]:
print(y_test)

[0 1]


<hr>

# Feature Scaling
* *Feature Scaling* is the process of standardizing or normalizing the numerical features of the dataset to bring them to a similar scale.
* In the below code cells, `StandardScaler` is used to scale the numerical features in the training and testing sets to have zero mean and unit variance. This ensures that all features contribute equally to the model training and prevents bias due to different scales.

In [16]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train[:, 3:] = ss.fit_transform(X_train[:, 3:])
X_test[:, 3:] = ss.fit_transform(X_test[:, 3:])

In [17]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [18]:
print(X_test)

[[0.0 1.0 0.0 -1.0 -1.0]
 [1.0 0.0 0.0 1.0 1.0]]


<hr>